-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rechecking pending Pods (conflict resolved) #375
Rechecking pending Pods (conflict resolved) #375
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should re-write the reconciler (which is cron-triggered) to use informers instead of this retry when the pod is pending.
We would be able to re-queue that pod liveness check to a later date.
Looking at the code, I'm especially worried about the blocking wait until the pod is no longer in pending
state.
EDIT: another thing that could be helpful is to use pagination when listing the resources. Of course this would only make sense if we manage to correlate the number of the returned API results to the OOM kills. |
@maiqueb can you point out where a blocking wait is happening? AFAIU, there's only a wait for 500ms at a time, and only three retries can happen, totalling 1.5s of waiting per pod. To me this seems very reasonable even if the issue were to crop up in several pods. |
91eef1c
to
48c03a5
Compare
Fixed a unit test (a test that was previously expected to fail should now pass) with latest force push. It was a bit too small of a change to keep as its own commit so I squashed it down. |
It's a sleep: we block the thread for that duration. Trying to say we should try to re-queue the request, and check if in the next iteration the pod we read is no longer in pending state. Is there something preventing this approach ? |
To summarize some discussion: we have decided to proceed with merging this for now, and to track Miguel's suggested implementation as a separate task. The downside of the current implementation is that having a lot of pending pods at the same time will cause the reconcile cycle to take a long time. However, the current implementation still solves pods stuck in pending state, and is overall better than not having a fix. To do things the "proper" way, we will need to keep a list of the pending pods in the reconcile looper struct, and retry for them. This would also need to be integrated with the ip-control-loop to sync retries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working through this for consensus and glad we'll follow up on the blocking issues Miguel mentioned
This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.
This is a rebase of #195